PAL Group|Pattern Analysis and Learning Group

Home

People

Projects

Publications

Seminar

Activities

Data&Codes

CASIA-BiRViT1K: Bilingual Road scene Video Text Dataset

1. Introduction

The Bilingual Road scene Video Text Dataset (BiRViT1K) was constructed by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). It contains 1000 videos, including 300 Chinese videos, 300 English videos and 400 bilingual videos. We annotate a total of 64,001 frames with 806,011 text instances in line-level, and every text instance is labeled with a quadrilateral, a transcript and a tracking identification (ID). We randomly select 70% of the videos of each type as the training set and the rest as the test set, so the training set contains 44,808 frames from 700 videos and the test set contains 19,193 frames from 300 videos. Fig. 1 shows some images of different scenes in this dataset.

Fig. 1 Some images of different scenes in the BiRViT1K.s

CASIA-BiRViT1K_part01.rar

CASIA-BiRViT1K_part02.rar

CASIA-BiRViT1K_part03.rar

CASIA-BiRViT1K_part04.rar

CASIA-BiRViT1K_part05.rar

CASIA-BiRViT1K_part06.rar

CASIA-BiRViT1K_part07.rar

CASIA-BiRViT1K_part08.rar

2. Annotations

We annotate the text instances in videos, including Chinese, English, Arabic numerals, common symbols (e.g. commas, periods and spaces). And in this dataset, we use the quadrilateral annotation format. For each text instance, its label includes the coordinates of the four corners of the text box, the transcripts and the tracking identification (ID). If the text instance is less recognizable or most of the area is truncated, we record its transcript as "###". Fig. 2 shows some annotations of video frames. As shown in Fig. 1. and Fig. 2, the scale of the text instances in our dataset is small, and the forms are diverse (license plates, shop names, traffic signs, etc.), which makes it more challenging.

Fig. 2 The annotations of video frames.

3. Dataset Format

We provide two label formats:
(1) A txt file is provided for each image, each line represents a text instance, including corner coordinates, text content and ID, separated by '\t';
"x1,y1,x2,y2,x3,y3,x4,y4 text ID"
(2) A json file is provided for traing set and test set respectively. The format of the annotation file is as follows: